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A bipartition of HIV-1 RNA genome sequences into single- and double-stranded nucleotides is possible based on the 
secondary structure model of a complete 9 kb genome. Subsequent analysis revealed that the well-known lentiviral 
property of A-accumulation is profoundly present in single-stranded domains, yet absent in double-stranded domains. 
Mutational rate analysis by means of an unrestricted model of nucleotide substitution suggests the presence of an 
evolutionary equilibrium to preserve this biased nucleotide distribution. 



Introduction 

The tendency of lentiviral open reading frames to become 
A-rich has been documented previously. 1 For instance, the sin- 
gle-stranded RNA genome of HIV-1 contains 36.2% A, 23.9% 
G, 22.2% U and 17.6% C. The increased A-content dictates 
the typical codon usage of this virus. 2,3 The apparent selection 
of A-rich codons in HIV genomes even contributes to a biased 
amino acid composition of the encoded proteins. 4 Also, HIV 
particles contain tRNAs that decode A-ending codons, suggest- 
ing a modulation of the cellular tRNA pool toward the typical 
codon preference of HIV genes. 5 These basic RNA properties are 
well conserved over time and among the different members of 
the Lentiviridae. 6,7 dCTP pool imbalance during reverse tran- 
scription has been proposed as a cause of G~>A hypermutation 
of the HIV-1 genome. 8,9 dNTP pool imbalance appeared to con- 
tribute more to HIV evolution in vivo than sequence editing by 
the cellular restriction factors Apobec 3G/3F. 10 A reduction of the 
A-richness of HIV-1 polymerase sequences impaired viral DNA 
synthesis," but a biological function for this typical lentiviral 
A-pressure has not yet been elucidated. 12 

Recently, a secondary structure model of the complete 9 kb 
HIV-1 RNA genome at single nucleotide resolution has been 
constructed by means of a combined chemical assay of nucleo- 
tide accessibility (SHAPE, see ref. 13) and RNA folding predic- 
tion (RNAstructure, see refs. 14 and 15). The biased nucleotide 
composition of the HIV-1 RNA genome will definitely have some 
implications for the distribution of the different nucleotides over 
the structured RNA genome. Even when assuming maximal base 
pairing across the genome, the character of the possible base pairs 
(G-C, A-U and G-U and the reverse set of three) dictates that not 
every A can be paired given the 14% surplus of A (36.2%) over 
its unique pairing partner U (22.2%). This would mean that the 



single-standed regions of HIV-1 RNA will statistically have a sur- 
plus of A and possibly G over U and particularly C. As these pat- 
terns could constitute a distinct molecular signature of the viral 
genome, we set out to further analyze the nucleotide distribution 
in the context of the HIV-1 RNA secondary structure model. 13 

Results 

Nucleotide composition of the structured HIV-1 RNA genome. 

The nucleotide composition differs significantly between single- 
and double-stranded regions of the HIV-1 RNA structure model 
of the NL4-3 isolate (Table 1). Of the total 9,173 nucleotides, 
59% and 41% are present in these ss and ds regions, respectively. 
As much as 79% of A nucleotides in this HIV-1 RNA genome 
participate in the ss parts. In other words, almost four of five A 
nucleotides are predicted to be unpaired in this highly structured 
RNA molecule. In contrast, 57% of U, 45% of G and only 38% 
of C are found in ss regions. These striking data indicate a dif- 
ferential nucleotide bias in the ss vs. ds domains of HIV-1 RNA. 
Apparently, the lentiviral property of A-pressure at the expense 
of C as described previously 2 ' 4 is intensified in the ss regions 
(A-rich and C-poor with 79% and 38%, respectively) but absent 
in the ds parts that show a strikingly reversed pattern (C-rich 
and A-poor with 62% and 21%, respectively). Analysis of the 
HIV-1 sequence after partition into separate reading frames and 
codon positions (GAG, POL, ENV and NEF, excluding regions 
with gene overlap) confirmed these patterns (Table SI). The 
combined 5' and 3' non-coding regions displayed twice as much 
paired nucleotides than the genes and a concomitant decrease in 
A-content. 16,17 

Analysis of the base pairs in the HIV-1 RNA secondary struc- 
ture model indicates that the most stable GC and CG base pairs 
are used more frequently than AU and UA pairs (Fig. 1). The 
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Table 1. Biased nucleotide composition of single- and double-stranded 
regions in HIV-1 NL4-3 RNA 
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ss ds 


Number 
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ss/ds 


All 


ss 


5377 


(0.59) 


1.416 


ds 
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ds 
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(0.55) 



0.3 
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Figure 1. Base pair composition of the double-stranded portion of the 
HIV-1 NL4-3 RNA structure. 



least stable GU and UG pairs are present at an even lower fre- 
quency. This unequal base pair composition correlates with the 
slightly preferred occurrence of G and C in the ds parts of the 
HIV-1 genome (Table 1: 55% and 62%, respectively). 

The structure of the NL4-3 RNA genome has been deduced 
from a combination of experimental RNA structure probing 
data and computational RNA structure prediction. In short, the 
SHAPE reactivity assay monitored the accessibility of nucleo- 
tides in the RNA structure by chemical base modification. These 
experimental data were fed into the RNA-folding software 14 to 
obtain a pairing probability value for each individual nucleotide. 13 
We analyzed the distribution of the four nucleotides for SHAPE 
reactivity (Fig. 2). The A-nucleotides show a peak in SHAPE 
reactivity around 0.8, which contrasts with the much lower values 
calculated for the other three nucleotides. The SHAPE reactivity 
of the C-nucleotides is most restricted and largely confined to the 
0.2-0.4 window. These results indicate that A nucleotides are in 
general more exposed to chemical modification than the other 
nucleotides because the most As are single-stranded, whereas Cs 
are best protected against the modifying agent by base pairing. 
These structure-probing results are in agreement with the biased 



nucleotide composition based on the predicted HIV-1 RNA 
structure model: A is overrepresented in ss regions and C is found 
preferentially in ds domains (Table 1). This points to an intimate 
relationship between the nucleotide composition of the HIV-1 
RNA genome and its structure. 

The connection between base composition and secondary 
structure in the RNA genome of the HIV-1 strain NL4-3 may be 
exemplary for other virus isolates. To test this, the ss/ds designa- 
tion of NL4-3 was projected onto the corresponding nucleotides 
of 448 aligned HIV-1 subtype B sequences taken from the Los 
Alamos database (year 2010, no recombinants). An ss/ds bipar- 
tition was created without affecting the individual base-to-base 
alignments. Indeed, nucleotide frequencies differ between these 
two data sets quite similarly as described above for NL4-3 RNA 
(Table 2). The small values for standard deviation (StD) indi- 
cate considerable conservation of the typical nucleotide composi- 
tion in the ss and ds compartments. Apparently, the property of 
A-pressure at the expense of C is prominent in portions of HIV-1 
subtype B RNAs that represent unpaired regions in NL4-3 RNA. 

Different nucleotide substitution pattern in ss and ds 
domains of HIV-1 RNA. The strikingly different nucleotide 
composition of ss and ds RNA regions may point to different 
evolutionary rates of the nucleotides in these two domains. 
Maximum likelihood estimates of relative evolutionary rates for 
A, U, C and G nucleotides in ss and ds alignments confirmed this 
expectation (Table 3). A positive value indicates the substitution 
probability of a row nucleotide by one of the column nucleo- 
tides in the same row. A negative value on the matrix's diagonal 
represents the quantity to reduce the summarized values of the 
substitution probabilities in the same row to zero. From inspec- 
tion of the Qss matrix, it is obvious that the A-nucleotide shows 
the lowest probability and the C-nucleotide the highest probabil- 
ity of being substituted (-0.699428 and -1.776556, respectively). 
The single-stranded As alter most frequently into G (0.461079), 
followed by C (0.156830) and U (0.081518). G-nucleotides, in 
turn, rapidly change into A (1.148242), while G^C and G->U 
are relatively rare mutational events (0.114724 and 0.084045, 
respectively). Likewise, the C^U substitution is more promi- 
nent than U^C (1.007012 vs. 0.568940), C^A outscores A-^C 
(0.574634 vs. 0.156830) and U^A exceeds A-^U (0.138854 vs. 
0.081518). This nucleotide substitution pattern will lead to an 
accumulation of A at the expense of C, G and U until an equilib- 
rium is reached, which is exactly the nucleotide distribution that 
has been observed in the ss regions of HIV-1 subtype B RNAs 
(Table 2). 

The Qds matrix contrasts strongly with the Qss matrix. The 
A-nucleotide is most prone to substitution (-1.499043) and G~>A 
is slightly less probable than A-^G (0.736551 and 0.904678, 
respectively), which is in line with the enhanced proportion of G 
and the equivalent diminishment of A in ds domains of HIV-1 
RNA genomes (Table 2). It should be noted that these matri- 
ces have been constructed by means of an unrestricted model of 
nucleotide substitution without any constraining condition like 
reversibility, (partial) rate equality or fixed transition/transver- 
sion ratios. In addition, the RNA genomes of different HIV-1 
isolates generated nearly identical Q matrices (Table S2). 
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We report that ss and ds regions in HIV-1 RNA employ dif- 
ferent mutational patterns /signatures to maintain their distinct 
nucleotide composition. This may relate to experimental find- 
ings that indicate that local RNA structure can influence pausing 
of the reverse transcriptase enzyme, 1819 which may increase the 
probability of misincorporation. 20 Overall, the secondary RNA 
structure seems to pose serious constraints on the nucleotide 
composition and evolution of the HIV-1 RNA genome. 

Discussion 

It is known that the RNA genomes of retroviruses do not use an 
equal portion of the four possible nucleotides, the HIV-1 genome 
being particularly A-rich (36.2%) and C-poor (17.6%). 2 We now 
evaluated these biases with respect to the ss and ds nature of the 
nucleotides in the viral RNA genome. 13 We document a strik- 
ingly different nucleotide signature for the ss and ds regions. The 
bias is put to the extreme for the ss regions (47.5% A, 21.3% U, 
19.2% G and 11.9% C) and approaches a more neutral nucleo- 
tide composition for the ds regions (19.9% A, 23.6% U, 30.7% 
G and 25.8% C). We subsequently show that distinct mutational 
patterns can be observed in these two regions that will result in 
the maintenance of the typical nucleotide composition of the ss 
and ds regions. 

The paired/unpaired status of a nucleotide in a viral RNA 
structure can have several biological effects. For instance, 
chemical and Apobec 3G-mediated nucleotide modifica- 
tion affects ss RNA more than ds RNA. 21,22 Error rates of the 
HIV-1 reverse transcriptase differ by template structure, being 
higher for ss than ds RNA. 23,24 The biology of an RNA mol- 
ecule is obviously determined by properties other than the ss/ 
ds nature. The protein-coding capacity dictates the selection 
of certain strings of nucleotides to form the required codons. 
In protein-coding sequences, which concern nearly the entire 
HIV genome, shifts in codon bias are restricted by the availabil- 
ity of cellular aminoacyl-tRNAs, overlapping reading frames 
{tat, rev and env) and overlapping regulatory sequences (e.g., 
«f/"overlaps with the 3' long-terminal repeat). Indeed, the viral 
genome is riddled with specific sequence elements that control 
RNA splicing and many other processes such as RNA packag- 
ing into virion particles. Despite these multiple constraints, we 
disclosed a relatively simple pattern of biased nucleotide com- 
position that is highly related to the base-paired structure of 
the RNA molecule: excessive A-usage and C-restriction in the 
ss domains. 

The molecular mechanism responsible for the creation of this 
typical A-rich genome configuration remains unknown, but the 
new findings do specify our thoughts on the possible evolution- 
ary events. A priori, two possible scenarios can be envisaged 
that relate to the two independent steps of evolution: mutation 
and selection. The A-bias might arise through a preferred muta- 
tional activity and/or evolutionary selection. According to the 
first scenario, the generation of an A-rich genome may be caused 
by an enzymatic property of the error-prone reverse transcrip- 
tase enzyme or cellular editing activities encoded by the Apobec 
functions, which may induce G - *A hypermutation in HIV 




Figure 2. SHAPE reactivity of each nucleotide (A, U, Cand G) in HIV-1 
NL4-3 RNA. The histograms show increasing SHAPE reactivity in win- 
dows of 0.2 (X-axis, relative units). Frequency refers to the number of 
nucleotides per SHAPE window. Note the deviant SHAPE reactivity of 
the A-nucleotide. 



Table 2. Nucleotide frequencies in single- and double-stranded regions 
of HIV-1 subtype B RNA genomes 



448 HIV-1 isolates 


A 


U 


C 


G 


All 


AVG 


36.20 


22.23 


17.64 


23.94 


StD 


0.55 


0.17 


0.31 


0.33 


ss 


AVG 


47.50 


21.30 


11.90 


19.20 


StD 


0.40 


0.21 


0.26 


0.30 


ds 


AVG 


19.90 


23.60 


25.80 


30.70 


StD 


0.44 


0.23 


0.29 


0.30 



The alignment of 448 sequences (All) was divided in two parts by the 
ss or ds designation of bases in the NL4-3 RNA structure. The individual 
nucleotide compositions were used for the calculation of average and 
standard deviation. 



sequences. 4,9 ' 10 ' 25 The new finding of a clustering of A nucleo- 
tides in ss regions of the HIV-1 RNA genome does not support 
these mutational scenarios as a driving force for the acquisition of 
A-richness, as this would create an ubiquitously A-rich genome. 
Of course, we cannot exclude a mutational activity that is selec- 
tive for ss regions. 

According to the second scenario, HIV-1 and other lentivi- 
ruses have become A-rich (and C-poor) over evolutionary times 
by selective pressure. It is currently unknown what purpose is 
served by the strikingly differential base content of the HIV-1 
RNA genome, but the new finding that excessive A usage is 
restricted to the ss domains does support this scenario and fur- 
ther specifies the typical lentiviral genome requirements. Our 
favorite suggestion would be that an RNA genome with A-rich 
ss domains provides a molecular signature that is recognized 
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Table 3. Different patterns of nucleotide substitution for ss and ds nucleotides in HIV-1 subtype B RNA genomes 



Qss 


A 


U 


C 


G 


Qds 


A 


U 


C 




A 


-0.699428 


0.081518 


0.156830 


0.461079 


A 


-1.499043 


0.254533 


0.339833 


0.904678 


U 


0.138854 


-0.899975 


0.568940 


0.192180 


U 


0.175181 


-0.868417 


0.532371 


0.160865 


C 


0.574634 


1.007012 


-1.776556 


0.194910 


C 


0.300107 


0.471187 


-0.837239 


0.065945 


G 


1.148242 


0.084045 


0.114724 


-1.347011 


G 


0.736551 


0.068547 


0.064782 


-0.869880 



Patterns of nucleotide substitution are presented as rate matrices (Qss and Qds). A positive value of a row represents the rate of substitution of the row 
nucleotide into one of the column nucleotides. A negative value on the matrix diagonal is the quantity by which the sum of the positive row becomes 
reduced to zero (meaning a zero rate of substitution). An unrestricted model of nucleotide substitution was used. The two alignments of ss and ds 
nucleotides were analyzed in five batches of 80 sequences. The resulting matrices (Table S2) were arithmetically averaged to obtain the two "consen- 
sus" matrices (Qss and Qds). 



during virus replication. This recognition could occur in the 
context of the virus replication cycle, e.g., in selective packaging 
of this RNA molecule into virion particles amidst an excess of 
other transcripts. Alternatively, this recognition could occur in 
the context of the virus-host interplay, e.g., in recognition of the 
invading RNA by cellular factors of the innate immune system. 
The virus may have adopted a particular genome architecture 
to adapt to cellular defense mechanisms. Interestingly, gag, pol 
and env transcripts lose the ability to induce type 1 interferon 
responses upon "translational optimization" of the codons. 26 A 
more accurate description of this lentiviral RNA structure by bio- 
physical means, 3D-modeling and functional studies, e.g., bind- 
ing studies with candidate viral or cellular proteins, should help 
to unravel the underlying biological meaning of this particular 
RNA genome architecture. 

Materials and Methods 

The RNA sequence of HIV-1 isolate NL4-3, belonging to sub- 
type B, its structure, SHAPE reactivity data and base-pairing 
probabilities were taken from Watts et al. 13 The Los Alamos HIV 
database (www.hiv.lanl.gov) provided aligned genomes of HIV-1 
subtype B isolates (year 2010, 448 genomes, no recombinants). 
The NL4-3 RNA sequence, including its single-stranded (ss) or 
double-stranded (ds) designation for each nucleotide position, was 
manually made part of this alignment. All nucleotides involved 
in base pairing (regular Watson-Crick and G-U/U-G pairs) were 
scored as ds, and unpaired nucleotides (interhelical segments, hair- 
pin loops and internal loops, bulges) as ss. Subsequently, a biparti- 
tion was created guided by the ss or ds designation under stringent 
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